Self-Supervised Depth Learning for Urban Scene Understanding

نویسندگان

  • Huaizu Jiang
  • Erik G. Learned-Miller
  • Gustav Larsson
  • Michael Maire
  • Gregory Shakhnarovich
چکیده

As an agent moves through the world, the apparent motion of scene elements is (usually) inversely proportional to their depth.1 It is natural for a learning agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, far away mountains don’t move much; nearby trees move a lot. This natural relationship between the appearance of objects and their motion is a rich source of information about the world. In this work, we train a deep network, using fully automatic supervision, to predict relative scene depth from single images. The depth training images are automatically derived from simple videos of cars moving through a scene, using classic depth from motion techniques, and no humanprovided labels. We show that this pretext task of predicting depth from a single image induces features in the network that result in large improvements in a set of downstream tasks including semantic segmentation, joint road segmentation and car detection, and monocular (absolute) depth estimation, over a network trained from scratch. In particular, our pre-trained model outperforms an ImageNet counterpart for the monocular depth estimation task. Unlike work that analyzes video paired with additional information about direction of motion, our agent learns from “raw egomotion” video recorded from cars moving through the world. Unlike methods that require videos of moving objects, we neither depend on, nor are disrupted by, moving objects in the video. Indeed, we can benefit from predicting depth in the videos associated with various downstream tasks, showing that we can adapt to new scenes in an unsupervised manner to improve performance. By doing so, we achieve consistently better results over several different urban scene understanding tasks, obtaining results that are competitive with state-of-the-art method for monocular depth estimation. 1Strictly speaking, this statement is true only after one has compensated for camera rotation, individual object motion, and image position. We address these issues in the paper.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SceneNet: Understanding Real World Indoor Scenes With Synthetic Data

Scene understanding is a prerequisite to many high level tasks for any automated intelligent machine operating in real world environments. Recent attempts with supervised learning have shown promise in this direction but also highlighted the need for enormous quantity of supervised data — performance increases in proportion to the amount of data used. However, this quickly becomes prohibitive w...

متن کامل

Semantic Foggy Scene Understanding with Synthetic Data

This work addresses the problem of semantic foggy scene understanding (SFSU). Although extensive research has been performed on image dehazing and on semantic scene understanding with weatherclear images, little attention has been paid to SFSU. Due to the difficulty of collecting and annotating foggy images, we choose to generate synthetic fog on real images that depict weather-clear outdoor sc...

متن کامل

Integration of Perceptual Grouping and Depth

Different data acquisition methods are tailored at extracting particular characteristics from a scene and by combining their results a more robust scene description can be created. A method to fuse perceptual groupings extracted from color-based segmentation and depth information from stereo using supervised classification is presented. The merging of data from these two acquisition modules all...

متن کامل

Self-Supervised Siamese Learning on Stereo Image Pairs for Depth Estimation in Robotic Surgery

INTRODUCTION Robotic surgery has become a powerful tool for performing minimally invasive procedures, providing advantages in dexterity, precision, and 3D vision, over traditional surgery. One popular robotic system is the da Vinci surgical platform, which allows preoperative information to be incorporated into live procedures using Augmented Reality (AR). Scene depth estimation is a prerequisi...

متن کامل

Online generation of scene descriptions in urban environments

The ability to extract a rich set of semantic workspace labels from sensor data gathered in complex environments is a fundamental prerequisite to any form of semantic reasoning in mobile robotics. In this paper we present an online system for the augmentation of maps of outdoor urban environments with such higher-order, semantic labels. The system employs a shallow supervised classification hie...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1712.04850  شماره 

صفحات  -

تاریخ انتشار 2017